< 02 - Character Building | Home | 04 - Time and Chronology >
By now you should have a decent understanding of how bookworm assembles a list of character relationships and assesses their strength.
The real point of this project, though, is to give the user a tactile, intuitive view of the network of characters and how they interact. This notebook should cover the methods I've used to achieve that.
Let's start by importing all the usual stuff and loading in the Harry Potter network:
In [5]:
from bookworm import *
In [6]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12,9)
import pandas as pd
import numpy as np
In [7]:
book = load_book('data/raw/hp_philosophers_stone.txt')
characters = extract_character_names(book)
sequences = get_sentence_sequences(book)
df = find_connections(sequences, characters)
cooccurence = calculate_cooccurence(df)
In [8]:
import networkx as nx
interaction_df = get_interaction_df(cooccurence, threshold=2)
interaction_df.sample(5)
Out[8]:
get_interaction_df()
is defined in bookworm/build_network.py, and works by searching through the provided cooccurence matrix for interactions with strength above a specified threshold.
We can load that interaction dataframe into a NetworkX Graph using the super simple from_pandas_dataframe()
function:
In [9]:
G = nx.from_pandas_dataframe(interaction_df,
source='source',
target='target')
And, just as easily, visualise it with draw_spring()
, where spring is a reference to the idea that edges in the network are treated like physical springs, with elasticity/compressability related to the weights of the connections:
In [10]:
nx.draw_spring(G, with_labels=True)
Very nice... ish. There's more that could be done to clean up the visualisation and make it pretty, but it's fine for now.
One of the nicest things about NetworkX is all of its builtin network analysis functionality. For example, we can use pagerank
or hits
to give us the most 'important' or 'central' nodes in the network. These algorithms were originally developed to analyse linked networks of websites, but they can just as easily be applied to stations in transport networks, streets in cities, similar products in ecommerce systems, friends in social circles, or connected characters in books.
In [11]:
pd.Series(nx.pagerank(G)).sort_values(ascending=False)[:5]
Out[11]:
In [12]:
a, b = nx.hits(G)
pd.Series(a).sort_values(ascending=False)[:5]
Out[12]:
We can ask NetworkX for cliques in the graph, which are especially relevant to social networks like this. enumerate_all_cliques()
gives us a massive list of all the cliques it finds - we'll just return the last one because it's most illustrative of what a clique is in this context...
In [13]:
list(nx.enumerate_all_cliques(G))[-1]
Out[13]:
It's isolated the people who appear in the book at Number 4, Privet Drive. Fun!
We can do stuff like illustrate the communicability of one character with another - We would expect that characters which don't spend much time together in the book would have a harder time communicating with one another than those who spend a lot of time together, illustrated by a smaller communicability value:
In [14]:
comms = nx.communicability(G)
print(comms["('Vernon ',)"]["('Dumbledore ',)"])
print(comms["('Harry ',)"]["('Hermione ',)"])
Similarly, we can use NetworkX's implementation of classic pathfinding algoritms like Dijkstra's algorithm and A* to return paths between characters. For example, if Hedwig was interested in getting to know Nicolas Flamel, and wanted to do so with as few new introductions as possible along the way, these are the shoulders she would need to tap on for introductions:
In [15]:
nx.dijkstra_path(G,
source="('Hedwig ',)",
target="('Flamel ',)")
Out[15]:
Pathfinding is clearly an application that is more suited to transport networks etc, but it's still interesting to see it applied here...
There's an anecdote which gets passed around about a young South Korean computer scientist in academia who wanted to rise to the top of his field as quickly as possible. By developing a network of the academics in his field and their people they had published with, he was able to quickly work out which authors were most influential, and the path of introductions and cooautorship that he would need to take from his own, weak position in the network to publishing papers with the most influencial academics and becoming a central node himself. I have no idea whether the anecdote is true or not, but it's a nice story, and illustrative of where and why this stuff might be useful to think about. Applying it to owls and alchemists is fun, but it can be useful in the real world too...
All of this stuff dates back to the 1730s and the origins of graph theory, with Euler and the Seven Bridges of Konigsberg. It's a subject worth reading about if you haven't already - it's fascinating, and the world opens up to you in entirely new ways when you develop some intuition around when and where networks appear in nature and how they can be analysed. Clever applications of graph theory are absolutely key to the success of companies like Google, Facebook, and Amazon.
The thing above is fast and fun, and allows us to run a load of interesting algorithms over the network, but it all feels very static... The point of this project is to visualise these networks in an way which gives the user an intuitive sense of the relationships between characters.
We can get closer to that intuitive, touchy-feely sense of the network by putting together a force directed graph with d3.js, like the one by Mike Bostock (the creatory of d3) shown here. Bostock is visualising the boring old Les Mis dataset - we're going to feed d3 our freshly made Harry Potter one.
First we need to set up the data structure which the d3 script requires.
In [35]:
nodes = [{"id": str(id), "group": 1} for id in set(interaction_df['source'])]
links = interaction_df.to_dict(orient='records')
d3_dict = {'nodes': nodes, 'links': links}
We can write that dictionary out to a .json
file in the project's d3 directory using the json package:
In [36]:
import json
with open('bookworm/d3/bookworm.json', 'w') as fp:
json.dump(d3_dict, fp)
Jupyter notebooks allow us to run commands in other languages, so we'll use bash
to do a few things from here on. For example, we can list the files in the d3 directory:
In [37]:
%%bash
ls bookworm/d3/
or print out one of those files:
In [38]:
%%bash
cat bookworm/d3/index.html
The next cell can be used to set up a locally hosted version of that d3.js script.
It's a super-simple, two-line bash script which uses python's builtin http.server
module to run the javascript visualisation code in the browser on your machine.
We dumped our graph data into a file called 'bookworm.json'
in one of the cells above - that file can now processed by 'index.html'
(printed above), which displays the data using the d3.js
javascript library.
In [39]:
%%bash
cd bookworm/d3/
python -m http.server
When you've run the cell, open a new tab and go to the following address
localhost:8000
You should see a pretty graph representation of our network bouncing around. Hover over a node to see which character it corresponds to. Click and drag nodes to play around with it (This is super fun to do with your hands if you're running this on a touchscreen laptop. Playing with two hands also works!).
When you're finished playing, remember to navigate back to the two-line %%bash
cell above and push the STOP button to kill the local server. You won't be able to run any more code in this notebook until you do.
In the next notebook, we'll start considering the effect of time in novels and ways of representing temporal networks
< 02 - Character Building | Home | 04 - Time and Chronology >
In [ ]: